{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "IDdZSPcLtKx4"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Hub Authors.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "\n",
        "Created by @[Tahsin-Mayeesha](https://github.com/Tahsin-Mayeesha) for [Google Summer of Code](https://summerofcode.withgoogle.com/) 2019"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "-g5By3P4tavy"
      },
      "outputs": [],
      "source": [
        "# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS, \n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License.\n",
        "# =============================================================================="
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vpaLrN0mteAS"
      },
      "source": [
        "# Bangla Article Classification With TF-Hub"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GhN2WtIrBQ4y"
      },
      "source": [
        "This colab is a demonstration of using [Tensorflow Hub](https://www.tensorflow.org/hub/) for text classification in non-English/local languages. Here we choose [Bangla](https://en.wikipedia.org/wiki/Bengali_language) as the local language and use pretrained word embeddings to solve a multiclass classification task where we classify Bangla news articles in 5 categories.  The pretrained embeddings for Bangla comes from [fastText](https://fasttext.cc/docs/en/crawl-vectors.html) which is a library by Facebook with released pretrained word vectors for 157 languages. \n",
        "\n",
        "We'll use TF-Hub's pretrained embedding exporter for converting the word embeddings to a text embedding module first and then use the module to train a classifier with [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras), Tensorflow's high level user friendly API to build deep learning models.  Even if we are using fastText embeddings here, it's possible to export any other embeddings pretrained from other tasks and quickly get results with Tensorflow hub. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Q4DN769E2O_R"
      },
      "source": [
        "# Prepare Environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "zA07b51AGF5l"
      },
      "outputs": [],
      "source": [
        "!pip install -q tensorflow-gpu==2.0.0-beta1"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "zSeyZMq-BYsu"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow_hub as hub\n",
        "import numpy as np\n",
        "import os\n",
        "from sklearn.metrics import classification_report\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "%matplotlib inline\n",
        "import seaborn as sns\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9FB7gLU4F54l"
      },
      "source": [
        "# Dataset\n",
        "\n",
        "We will use [BARD](https://www.researchgate.net/publication/328214545_BARD_Bangla_Article_Classification_Using_a_New_Comprehensive_Dataset) (Bangla Article Dataset) which has around 3,76,226 articles collected from different Bangla news portals and labelled with 5 categories : economy, state, international, sports and entertainment. \n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "P2YW4GGa9Y5o"
      },
      "outputs": [],
      "source": [
        "!wget -O bard.zip https://github.com/Tahsin-Mayeesha/bard-dataset/blob/master/data.zip?raw=true"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "dZcoOH7z9jF-"
      },
      "outputs": [],
      "source": [
        "!unzip -q bard.zip\n",
        "!ls"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "js75OARBF_B8"
      },
      "source": [
        "# Export pretrained word vectors to TF-Hub module"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-uAicYA6vLsf"
      },
      "source": [
        "TF-Hub provides some handy scripts for converting word embeddings to TF-hub text embedding modules [here](https://github.com/tensorflow/hub/tree/master/examples/text_embeddings_v2). To make the module for Bangla or any other languages we simply have to download the word embedding .txt or .vec file to the same directory as export_v2.py and run the script.\n",
        "\n",
        "\n",
        "The exporter reads the embedding vectors and exports it to a Tensorflow [SavedModel](https://www.tensorflow.org/beta/guide/saved_model). A SavedModel contains a complete TensorFlow program including weights and graph. TF-Hub can load the SavedModel as a [module](https://www.tensorflow.org/hub/api_docs/python/hub/Module) which we will use to build the model for text classification. Since we are using tf.keras to build the model we will use [hub.KerasLayer](https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer) which provides a wrapper for a hub module to use as a Keras Layer.\n",
        "\n",
        "First we will get our word embeddings from fastText and embedding exporter from TF-Hub [repo](https://github.com/tensorflow/hub).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "5DY5Ze6pO1G5"
      },
      "outputs": [],
      "source": [
        "!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.bn.300.vec.gz\n",
        "!wget https://raw.githubusercontent.com/tensorflow/hub/master/examples/text_embeddings_v2/export_v2.py\n",
        "!gunzip cc.bn.300.vec.gz --k"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PAzdNZaHmdl1"
      },
      "source": [
        "Then we will run the exporter script on our embedding file. Since fastText embeddings has a header line and are pretty large(around 3.3 GB for bangla after converting to a module) we ignore the first line and export only the first 100, 000 tokens to the text embedding module."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Tkv5acr_Q9UU"
      },
      "outputs": [],
      "source": [
        "!python export_v2.py --embedding_file=cc.bn.300.vec --export_path=text_module --num_lines_to_ignore=1 --num_lines_to_use=100000"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "k9WEpmedF_3_"
      },
      "outputs": [],
      "source": [
        "module_path = \"text_module\"\n",
        "embedding_layer = hub.KerasLayer(module_path, trainable=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fQHbmS_D4YIo"
      },
      "source": [
        "The text embedding module takes a batch of sentences in a 1D tensor of strings as input and outputs the embedding vectors of shape (batch_size, embedding_dim) corresponding to the sentences. It preprocesses the input by splitting on spaces. Word embeddings are combined to sentence embeddings with the `sqrtn` combiner(See [here](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup_sparse)). For demonstration we pass a list of Bangla words as input and get the corresponding embedding vectors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Z1MBnaBUihWn"
      },
      "outputs": [],
      "source": [
        "embedding_layer(['বাস', 'বসবাস', 'ট্রেন', 'যাত্রী', 'ট্রাক']) "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4KY8LiFOHmcd"
      },
      "source": [
        "# Convert to Tensorflow Dataset \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pNguCDNe6bvz"
      },
      "source": [
        "Since the dataset is really large instead of loading the entire dataset in memory we will use a generator to yield samples in run-time in batches using [Tensorflow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) functionalities.  The dataset is also very imbalanced, so before using the generator we will shuffle the dataset. \n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "bYv6LqlEChO1"
      },
      "outputs": [],
      "source": [
        "dir_names = ['economy', 'sports', 'entertainment', 'state', 'international']\n",
        "\n",
        "file_paths = []\n",
        "labels = []\n",
        "for i, dir in enumerate(dir_names):\n",
        "  file_names = [\"/\".join([dir, name]) for name in os.listdir(dir)]\n",
        "  file_paths += file_names\n",
        "  labels += [i] * len(os.listdir(dir))\n",
        "  \n",
        "np.random.seed(42)\n",
        "permutation = np.random.permutation(len(file_paths))\n",
        "\n",
        "file_paths = np.array(file_paths)[permutation]\n",
        "labels = np.array(labels)[permutation]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8b-UtAP5TL-W"
      },
      "source": [
        "We can check the distribution of labels in the training and validation examples after shuffling."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "4BNXFrkotAYu"
      },
      "outputs": [],
      "source": [
        "# plot training vs validation distribution\n",
        "train_size = int(len(file_paths) * 0.80)\n",
        "plt.subplot(1, 2, 1)\n",
        "plt.hist(labels[0:train_size])\n",
        "plt.title(\"Train labels\")\n",
        "plt.subplot(1, 2, 2)\n",
        "plt.hist(labels[train_size:])\n",
        "plt.title(\"Validation labels\")\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RVbHb2I3TUNA"
      },
      "source": [
        "To create a [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) using generator we first write a generator function which reads each of the articles from file_paths and the labels from the label array, and yields one training example at each step. We pass this generator function to the [tf.data.Dataset.from_generator](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator) method and specify the output types. Each training example is a tuple containing an article of tf.string data type and one-hot encoded label. We split the dataset with a train-validation split of 80-20 using the [`skip`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#skip) and [`take`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) method."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "eZRGTzEhUi7Q"
      },
      "outputs": [],
      "source": [
        "def generator():\n",
        "  for i, file in enumerate(file_paths):\n",
        "      label = tf.keras.utils.to_categorical(labels[i], num_classes=5)\n",
        "      article = tf.io.read_file(file)\n",
        "      yield article, label"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "2g4nRflB7fbF"
      },
      "outputs": [],
      "source": [
        "def make_dataset(train_size):\n",
        "  data = tf.data.Dataset.from_generator(generator=generator, \n",
        "                                        output_types=(tf.string, tf.float32))\n",
        "  train_size = int(train_size * len(file_paths))\n",
        "  train_data = data.take(train_size)\n",
        "  validation_data = data.skip(train_size)\n",
        "  return train_data, validation_data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "8PuuN6el8tv9"
      },
      "outputs": [],
      "source": [
        "train_data, validation_data = make_dataset(0.80)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MrdZI6FqPJNP"
      },
      "source": [
        "# Model Training and Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "jgr7YScGVS58"
      },
      "source": [
        "Since we have already added a wrapper around our module to use it as any other layer in keras we can create a small [Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) model which is a linear stack of layers. We can add our text embedding module with `model.add` just like any other layer. We compile the model by specifying the loss and optimizer and train it for 10 epochs. tf.keras API can handle tensorflow datasets as input, so we can pass a Dataset instance to the fit method for model training. Since we are using a generator function, tf.data will handle generating the samples, batching them and feeding them to the model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WhCqbDK2uUV5"
      },
      "source": [
        "## Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "nHUw807XPPM9"
      },
      "outputs": [],
      "source": [
        "def create_model():\n",
        "  model = tf.keras.Sequential()\n",
        "  model.add(embedding_layer)\n",
        "  model.add(tf.keras.layers.Dense(64, activation=\"relu\"))\n",
        "  model.add(tf.keras.layers.Dense(16, activation=\"relu\"))\n",
        "  model.add(tf.keras.layers.Dense(5, activation=\"softmax\"))\n",
        "  model.compile(loss=\"categorical_crossentropy\", optimizer=\"adam\", metrics=['accuracy'])\n",
        "  return model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "5J4EXJUmPVNG"
      },
      "outputs": [],
      "source": [
        "model = create_model()\n",
        "# Create earlystopping callback\n",
        "early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZZ7XJLg2u2No"
      },
      "source": [
        "## Training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "OoBkN2tAaXWD"
      },
      "outputs": [],
      "source": [
        "batch_size = 256\n",
        "history = model.fit(train_data.batch(batch_size), \n",
        "                    validation_data=validation_data.batch(batch_size), \n",
        "                    epochs=5, \n",
        "                    callbacks=[early_stopping_callback])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XoDk8otmMoT7"
      },
      "source": [
        "## Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "G5ZRKGOsXEh4"
      },
      "source": [
        "We can visualize the accuracy and loss curves for training and validation data using the `history` object returned by the `fit` method which contains the loss and accuracy value for each epoch."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "V6tOnByIOeGn"
      },
      "outputs": [],
      "source": [
        "# Plot training \u0026 validation accuracy values\n",
        "plt.plot(history.history['accuracy'])\n",
        "plt.plot(history.history['val_accuracy'])\n",
        "plt.title('Model accuracy')\n",
        "plt.ylabel('Accuracy')\n",
        "plt.xlabel('Epoch')\n",
        "plt.legend(['Train', 'Test'], loc='upper left')\n",
        "plt.show()\n",
        "\n",
        "# Plot training \u0026 validation loss values\n",
        "plt.plot(history.history['loss'])\n",
        "plt.plot(history.history['val_loss'])\n",
        "plt.title('Model loss')\n",
        "plt.ylabel('Loss')\n",
        "plt.xlabel('Epoch')\n",
        "plt.legend(['Train', 'Test'], loc='upper left')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9DeGZFXsJt5g"
      },
      "source": [
        "## Saving model\n",
        "\n",
        "After training the model we can export it as a [SavedModel](https://www.tensorflow.org/beta/guide/saved_model) to deploy or share with others."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "rIO_CseWJtJP"
      },
      "outputs": [],
      "source": [
        "tf.saved_model.save(model, export_dir=\"my_model\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "D54IXLqcG8Cq"
      },
      "source": [
        "## Prediction\n",
        "\n",
        "We can get the predictions for the validation data and check the confusion matrix to see the model's performance for each of the 5 classes. As `predict` method returns us the n-d array for probabilities for each class which we convert to class labels using `np.argmax`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "dptEywzZJk4l"
      },
      "outputs": [],
      "source": [
        "model =  tf.keras.models.load_model(\"my_model\")\n",
        "y_pred = model.predict(validation_data.batch(batch_size))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "7Dzeml6Pk0ub"
      },
      "outputs": [],
      "source": [
        "y_pred = np.argmax(y_pred, axis=1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "T4M3Lzg8jHcB"
      },
      "outputs": [],
      "source": [
        "samples = file_paths[0:3]\n",
        "for i, sample in enumerate(samples):\n",
        "  f = open(sample)\n",
        "  text = f.read()\n",
        "  print(text[0:100])\n",
        "  print(\"True Class: \", sample.split(\"/\")[0])\n",
        "  print(\"Predicted Class: \", dir_names[y_pred[i]])\n",
        "  f.close()\n",
        "  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PlDTIpMBu6h-"
      },
      "source": [
        "## Compare Performance\n",
        "\n",
        "Now we can take the correct labels for the validation data from `labels`  and compare it with our predictions to get the [classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "mqrERUCS1Xn7"
      },
      "outputs": [],
      "source": [
        "y_true = np.array(labels[train_size:])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "NX5w-NuTKuVP"
      },
      "outputs": [],
      "source": [
        "print(classification_report(y_true, y_pred, target_names=dir_names))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "p5e9m3bV6oXK"
      },
      "source": [
        "We can also compare our model's performance with the published results obtained in the original [paper](https://www.researchgate.net/publication/328214545_BARD_Bangla_Article_Classification_Using_a_New_Comprehensive_Dataset) who report a 0.96 precision .The original authors described many preprocessing steps done on the dataset like dropping punctuations and digits, removing top 25 most frequest stop words. As we can see in the classification_report, we also gain a 0.96 precision and accuracy after training only 5 epochs without any preprocessing! \n",
        "\n",
        "In this example when we created the Keras layer from our embedding module we set `trainable=False`, which means the embedding weights will not be updated during training. Try setting it to True to reach 97% accuracy with this dataset with only 2 epochs. "
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "Bangla article classifier.ipynb",
      "provenance": [],
      "version": "0.3.2"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
